Rochester
EEG-GRAPH: A Factor-Graph-Based Model for Capturing Spatial, Temporal, and Observational Relationships in Electroencephalograms
Yogatheesan Varatharajah, Min Jin Chong, Krishnakant Saboo, Brent Berry, Benjamin Brinkmann, Gregory Worrell, Ravishankar Iyer
This paper presents a probabilistic-graphical model that can be used to infer characteristics of instantaneous brain activity by jointly analyzing spatial and temporal dependencies observed in electroencephalograms (EEG). Specifically, we describe a factor-graph-based model with customized factor-functions defined based on domain knowledge, to infer pathologic brain activity with the goal of identifying seizure-generating brain regions in epilepsy patients. We utilize an inference technique based on the graph-cut algorithm to exactly solve graph inference in polynomial time. We validate the model by using clinically collected intracranial EEG data from 29 epilepsy patients to show that the model correctly identifies seizure-generating brain regions. Our results indicate that our model outperforms two conventional approaches used for seizure-onset localization (5-7% better AUC: 0.72, 0.67, 0.65) and that the proposed inference technique provides 3-10% gain in AUC ( 0.72, 0.62, 0.69) compared to sampling-based alternatives.
Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design
Jimenez, Maria Lizarazo, Claros, Ana Gabriela, Green, Kieran, Toro-Tobon, David, Larios, Felipe, Asthana, Sheena, Wenczenovicz, Camila, Maldonado, Kerly Guevara, Vilatuna-Andrango, Luis, Proano-Velez, Cristina, Bandi, Satya Sai Sri, Bagewadi, Shubhangi, Branda, Megan E., Zahidy, Misk Al, Luz, Saturnino, Lapata, Mirella, Brito, Juan P., Ponce-Ponte, Oscar J.
Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients' biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.
Longitudinal and Multimodal Recording System to Capture Real-World Patient-Clinician Conversations for AI and Encounter Research: Protocol
Zahidy, Misk Al, Maldonado, Kerly Guevara, Andrango, Luis Vilatuna, Proano, Ana Cristina, Claros, Ana Gabriela, Jimenez, Maria Lizarazo, Toro-Tobon, David, Montori, Victor M., Ponce-Ponte, Oscar J., Brito, Juan P.
The promise of AI in medicine depends on learning from data that reflect what matters to patients and clinicians. Most existing models are trained on electronic health records (EHRs), which capture biological measures but rarely patient-clinician interactions. These relationships, central to care, unfold across voice, text, and video, yet remain absent from datasets. As a result, AI systems trained solely on EHRs risk perpetuating a narrow biomedical view of medicine and overlooking the lived exchanges that define clinical encounters. Our objective is to design, implement, and evaluate the feasibility of a longitudinal, multimodal system for capturing patient-clinician encounters, linking 360 degree video/audio recordings with surveys and EHR data to create a dataset for AI research. This single site study is in an academic outpatient endocrinology clinic at Mayo Clinic. Adult patients with in-person visits to participating clinicians are invited to enroll. Encounters are recorded with a 360 degree video camera. After each visit, patients complete a survey on empathy, satisfaction, pace, and treatment burden. Demographic and clinical data are extracted from the EHR. Feasibility is assessed using five endpoints: clinician consent, patient consent, recording success, survey completion, and data linkage across modalities. Recruitment began in January 2025. By August 2025, 35 of 36 eligible clinicians (97%) and 212 of 281 approached patients (75%) had consented. Of consented encounters, 162 (76%) had complete recordings and 204 (96%) completed the survey. This study aims to demonstrate the feasibility of a replicable framework for capturing the multimodal dynamics of patient-clinician encounters. By detailing workflows, endpoints, and ethical safeguards, it provides a template for longitudinal datasets and lays the foundation for AI models that incorporate the complexity of care.
Opportunistic Screening for Pancreatic Cancer using Computed Tomography Imaging and Radiology Reports
Le, David, Correa-Medero, Ramon, Tariq, Amara, Patel, Bhavik, Yano, Motoyo, Banerjee, Imon
Pancreatic ductal adenocarcinoma (PDAC) is a highly aggressive cancer, with most cases diagnosed at stage IV and a five-year overall survival rate below 5%. Early detection and prognosis modeling are crucial for improving patient outcomes and guiding early intervention strategies. In this study, we developed and evaluated a deep learning fusion model that integrates radiology reports and CT imaging to predict PDAC risk. The model achieved a concordance index (C-index) of 0.6750 (95% CI: 0.6429, 0.7121) and 0.6435 (95% CI: 0.6055, 0.6789) on the internal and external dataset, respectively, for 5-year survival risk estimation. Kaplan-Meier analysis demonstrated significant separation (p<0.0001) between the low and high risk groups predicted by the fusion model. These findings highlight the potential of deep learning-based survival models in leveraging clinical and imaging data for pancreatic cancer.
LLM-Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation
Li, Xiaodi, Chowdhury, Shaika, Wi, Chung Il, Vassilaki, Maria, Liu, Ken, Sio, Terence T, Garrick, Owen, Juhn, Young J, Cerhan, James R, Tao, Cui, Zong, Nansu
Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria. We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models. Our approach consists of four key components. First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs). Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions. Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels. Fourth, an evaluation module assesses the fine-tuned model's performance on the testing datasets. We evaluated LLM-Match on four open datasets - n2c2, SIGIR, TREC 2021, and TREC 2022 - using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models. LLM-Match outperformed all baselines.
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Ahsan, Hiba, Sharma, Arnab Sen, Amir, Silvio, Bau, David, Wallace, Byron C.
We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in middle MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
Position Paper: Building Trust in Synthetic Data for Clinical AI
Babu, Krishan Agyakari Raja, Mulay, Supriti, Prabhu, Om, Sivaprakasam, Mohanasankar
Deep generative models and synthetic medical data have shown significant promise in addressing key challenges in healthcare, such as privacy concerns, data bias, and the scarcity of realistic datasets. While research in this area has grown rapidly and demonstrated substantial theoretical potential, its practical adoption in clinical settings remains limited. Despite the benefits synthetic data offers, questions surrounding its reliability and credibility persist, leading to a lack of trust among clinicians. This position paper argues that fostering trust in synthetic medical data is crucial for its clinical adoption. It aims to spark a discussion on the viability of synthetic medical data in clinical practice, particularly in the context of current advancements in AI. We present empirical evidence from brain tumor segmentation to demonstrate that the quality, diversity, and proportion of synthetic data directly impact trust in clinical AI models. Our findings provide insights to improve the deployment and acceptance of synthetic data-driven AI systems in real-world clinical workflows.